NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Learning Factorized Multimodal Representations

Tsai, Yao-Hung Hubert; Liang, Paul Pu; Zadeh, Amir; Morency, Louis-Philippe; Salakhutdinov, Ruslan (February 2019, International Conference on Representation Learning)

Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.
more » « less
Full Text Available
Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph

Tsai, Yao-Hung Hubert; Divvala, Santosh; Morency, Louis-Philippe; Salakhutdinov, Ruslan; Farhadi, Ali (January 2019, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition)

Visual relationship reasoning is a crucial yet challenging task for understanding rich interactions across visual concepts. For example, a relationship 'man, open, door' involves a complex relation 'open' between concrete entities 'man, door'. While much of the existing work has studied this problem in the context of still images, understanding visual relationships in videos has received limited attention. Due to their temporal nature, videos enable us to model and reason about a more comprehensive set of visual relationships, such as those requiring multiple (temporal) observations (e.g., 'man, lift up, box' vs. 'man, put down, box'), as well as relationships that are often correlated through time (e.g., 'woman, pay, money' followed by 'woman, buy, coffee'). In this paper, we construct a Conditional Random Field on a fully-connected spatio-temporal graph that exploits the statistical dependency between relational entities spatially and temporally. We introduce a novel gated energy function parametrization that learns adaptive relations conditioned on visual observations. Our model optimization is computationally efficient, and its space computation complexity is significantly amortized through our proposed parameterization. Experimental results on benchmark video datasets (ImageNet Video and Charades) demonstrate state-of-the-art performance across three standard relationship reasoning tasks: Detection, Tagging, and Recognition.
more » « less
Full Text Available
Transformer Dissection: An Unified Understanding for Transformer’s Attention via the Lens of Kernel

https://doi.org/10.18653/v1/D19-1443

Tsai, Yao-Hung Hubert; Bai, Shaojie; Yamada, Makoto; Morency, Louis-Philippe; Salakhutdinov, Ruslan (January 2019, Proceedings of the Conference on Empirical Methods in Natural Language Processing)

Full Text Available
Strong and Simple Baselines for Multimodal Utterance Embeddings

https://doi.org/10.18653/v1/N19-1267

Liang, Paul Pu; Lim, Yao Chong; Tsai, Yao-Hung Hubert; Salakhutdinov, Ruslan; Morency, Louis-Philippe (January 2019, Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies)

Full Text Available
Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization

https://doi.org/10.18653/v1/P19-1152

Liang, Paul Pu; Liu, Zhun; Tsai, Yao-Hung Hubert; Zhao, Qibin; Salakhutdinov, Ruslan; Morency, Louis-Philippe (January 2019, Proceedings of the Annual Meeting of the Association for Computational Linguistics)

Full Text Available
Multimodal Transformer for Unaligned Multimodal Language Sequences

https://doi.org/10.18653/v1/P19-1656

Tsai, Yao-Hung Hubert; Bai, Shaojie; Liang, Paul Pu; Kolter, J. Zico; Morency, Louis-Philippe; Salakhutdinov, Ruslan (January 2019, Proceedings of the Annual Meeting of the Association for Computational Linguistics)

Full Text Available

Search for: All records